In the rapidly evolving landscape of financial technology, credit scoring remains a cornerstone in determining the creditworthiness of applicants. This paper delves into the realm of predictive modeling, using a dataset provided by our professors, enriched with comprehensive credit scoring information. The overarching objective of our study is to train a model that efficiently and accurately determines the eligibility of individuals for credit approval.
The task set forth by our professors involves a comprehensive journey through four key exercises. The first exercise is centered around a thorough analysis of the dataset. Here, we delve into understanding the underlying structures, identifying critical variables, and most importantly, balancing the dataset to ensure a fair representation of diverse credit scenarios.
Subsequently, in the second exercise, we focus on training and testing a logistic classifier. This step is crucial in establishing a baseline model from which further improvements can be measured.
The third exercise involves enhancing the predictive performance of the model. This stage is particularly challenging and significant, as it entails fine-tuning the model to ensure it captures the nuances of credit scoring with greater accuracy and efficiency.
Finally, the fourth exercise invites us to step into the practical world. Here, we explore the various challenges a company may face if they were to implement our model in a real-world scenario. This exercise not only grounds our theoretical work in reality but also provides valuable insights into the practical implications and considerations in deploying machine learning models in the credit industry.
This paper aims to not only present a robust model for credit scoring but also to contribute to the broader understanding of how machine learning can be effectively utilized in financial decision-making processes.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
##
## [[18]]
## NULL
##
## [[19]]
## NULL
##
## [[20]]
## NULL
##
## [[21]]
## NULL
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'reshape2'
##
##
## The following object is masked from 'package:tidyr':
##
## smiths
## Warning: package 'plotly' was built under R version 4.3.1
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'lattice' was built under R version 4.3.1
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
## Attaching package: 'DescTools'
##
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## Warning: package 'pROC' was built under R version 4.3.1
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##
## Loaded ROSE 0.0-4
##
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
##
## Attaching package: 'PerformanceAnalytics'
##
## The following object is masked from 'package:graphics':
##
## legend
##
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:plotly':
##
## rename
##
## The following objects are masked from 'package:reshape2':
##
## colsplit, melt, recast
##
## The following object is masked from 'package:lubridate':
##
## stamp
##
## The following object is masked from 'package:dplyr':
##
## rename
##
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## Warning: package 'knitr' was built under R version 4.3.1
##
## Attaching package: 'dlookr'
##
## The following object is masked from 'package:corrr':
##
## correlate
##
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:base':
##
## transform
##
## corrplot 0.92 loaded
## [[1]]
## [1] "readr" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "readr" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "readr" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "readr" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "reshape2" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "readr" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "plotly" "reshape2" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "tidyr" "tibble" "ggplot2" "tidyverse" "readr"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[7]]
## [1] "caret" "lattice" "plotly" "reshape2" "lubridate" "forcats"
## [7] "stringr" "dplyr" "purrr" "tidyr" "tibble" "ggplot2"
## [13] "tidyverse" "readr" "stats" "graphics" "grDevices" "utils"
## [19] "datasets" "methods" "base"
##
## [[8]]
## [1] "DescTools" "caret" "lattice" "plotly" "reshape2" "lubridate"
## [7] "forcats" "stringr" "dplyr" "purrr" "tidyr" "tibble"
## [13] "ggplot2" "tidyverse" "readr" "stats" "graphics" "grDevices"
## [19] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "pROC" "DescTools" "caret" "lattice" "plotly" "reshape2"
## [7] "lubridate" "forcats" "stringr" "dplyr" "purrr" "tidyr"
## [13] "tibble" "ggplot2" "tidyverse" "readr" "stats" "graphics"
## [19] "grDevices" "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "ROCR" "pROC" "DescTools" "caret" "lattice" "plotly"
## [7] "reshape2" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [13] "tidyr" "tibble" "ggplot2" "tidyverse" "readr" "stats"
## [19] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[11]]
## [1] "ROSE" "ROCR" "pROC" "DescTools" "caret" "lattice"
## [7] "plotly" "reshape2" "lubridate" "forcats" "stringr" "dplyr"
## [13] "purrr" "tidyr" "tibble" "ggplot2" "tidyverse" "readr"
## [19] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [25] "base"
##
## [[12]]
## [1] "Boruta" "ROSE" "ROCR" "pROC" "DescTools" "caret"
## [7] "lattice" "plotly" "reshape2" "lubridate" "forcats" "stringr"
## [13] "dplyr" "purrr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [19] "readr" "stats" "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[13]]
## [1] "RColorBrewer" "Boruta" "ROSE" "ROCR" "pROC"
## [6] "DescTools" "caret" "lattice" "plotly" "reshape2"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "tidyr" "tibble" "ggplot2" "tidyverse" "readr"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[14]]
## [1] "ggcorrplot" "RColorBrewer" "Boruta" "ROSE" "ROCR"
## [6] "pROC" "DescTools" "caret" "lattice" "plotly"
## [11] "reshape2" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "readr" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
##
## [[15]]
## [1] "PerformanceAnalytics" "xts" "zoo"
## [4] "ggcorrplot" "RColorBrewer" "Boruta"
## [7] "ROSE" "ROCR" "pROC"
## [10] "DescTools" "caret" "lattice"
## [13] "plotly" "reshape2" "lubridate"
## [16] "forcats" "stringr" "dplyr"
## [19] "purrr" "tidyr" "tibble"
## [22] "ggplot2" "tidyverse" "readr"
## [25] "stats" "graphics" "grDevices"
## [28] "utils" "datasets" "methods"
## [31] "base"
##
## [[16]]
## [1] "corrr" "PerformanceAnalytics" "xts"
## [4] "zoo" "ggcorrplot" "RColorBrewer"
## [7] "Boruta" "ROSE" "ROCR"
## [10] "pROC" "DescTools" "caret"
## [13] "lattice" "plotly" "reshape2"
## [16] "lubridate" "forcats" "stringr"
## [19] "dplyr" "purrr" "tidyr"
## [22] "tibble" "ggplot2" "tidyverse"
## [25] "readr" "stats" "graphics"
## [28] "grDevices" "utils" "datasets"
## [31] "methods" "base"
##
## [[17]]
## [1] "networkD3" "corrr" "PerformanceAnalytics"
## [4] "xts" "zoo" "ggcorrplot"
## [7] "RColorBrewer" "Boruta" "ROSE"
## [10] "ROCR" "pROC" "DescTools"
## [13] "caret" "lattice" "plotly"
## [16] "reshape2" "lubridate" "forcats"
## [19] "stringr" "dplyr" "purrr"
## [22] "tidyr" "tibble" "ggplot2"
## [25] "tidyverse" "readr" "stats"
## [28] "graphics" "grDevices" "utils"
## [31] "datasets" "methods" "base"
##
## [[18]]
## [1] "reshape" "networkD3" "corrr"
## [4] "PerformanceAnalytics" "xts" "zoo"
## [7] "ggcorrplot" "RColorBrewer" "Boruta"
## [10] "ROSE" "ROCR" "pROC"
## [13] "DescTools" "caret" "lattice"
## [16] "plotly" "reshape2" "lubridate"
## [19] "forcats" "stringr" "dplyr"
## [22] "purrr" "tidyr" "tibble"
## [25] "ggplot2" "tidyverse" "readr"
## [28] "stats" "graphics" "grDevices"
## [31] "utils" "datasets" "methods"
## [34] "base"
##
## [[19]]
## [1] "knitr" "reshape" "networkD3"
## [4] "corrr" "PerformanceAnalytics" "xts"
## [7] "zoo" "ggcorrplot" "RColorBrewer"
## [10] "Boruta" "ROSE" "ROCR"
## [13] "pROC" "DescTools" "caret"
## [16] "lattice" "plotly" "reshape2"
## [19] "lubridate" "forcats" "stringr"
## [22] "dplyr" "purrr" "tidyr"
## [25] "tibble" "ggplot2" "tidyverse"
## [28] "readr" "stats" "graphics"
## [31] "grDevices" "utils" "datasets"
## [34] "methods" "base"
##
## [[20]]
## [1] "dlookr" "knitr" "reshape"
## [4] "networkD3" "corrr" "PerformanceAnalytics"
## [7] "xts" "zoo" "ggcorrplot"
## [10] "RColorBrewer" "Boruta" "ROSE"
## [13] "ROCR" "pROC" "DescTools"
## [16] "caret" "lattice" "plotly"
## [19] "reshape2" "lubridate" "forcats"
## [22] "stringr" "dplyr" "purrr"
## [25] "tidyr" "tibble" "ggplot2"
## [28] "tidyverse" "readr" "stats"
## [31] "graphics" "grDevices" "utils"
## [34] "datasets" "methods" "base"
##
## [[21]]
## [1] "corrplot" "dlookr" "knitr"
## [4] "reshape" "networkD3" "corrr"
## [7] "PerformanceAnalytics" "xts" "zoo"
## [10] "ggcorrplot" "RColorBrewer" "Boruta"
## [13] "ROSE" "ROCR" "pROC"
## [16] "DescTools" "caret" "lattice"
## [19] "plotly" "reshape2" "lubridate"
## [22] "forcats" "stringr" "dplyr"
## [25] "purrr" "tidyr" "tibble"
## [28] "ggplot2" "tidyverse" "readr"
## [31] "stats" "graphics" "grDevices"
## [34] "utils" "datasets" "methods"
## [37] "base"
## Rows: 40000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): grade, home_ownership, verification_status, purpose, application_type
## dbl (12): loan_amnt, int_rate, annual_inc, dti, open_acc, revol_bal, revol_u...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(loan_sample)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ loan_amnt : num [1:40000] 10000 16000 6000 9000 24000 8000 3600 5550 2000 10000 ...
## $ int_rate : num [1:40000] 11.47 9.49 17.99 9.75 12.59 ...
## $ grade : chr [1:40000] "B" "B" "D" "B" ...
## $ home_ownership : chr [1:40000] "RENT" "MORTGAGE" "RENT" "MORTGAGE" ...
## $ annual_inc : num [1:40000] 35000 110000 40000 54000 66000 ...
## $ verification_status: chr [1:40000] "Verified" "Not Verified" "Verified" "Verified" ...
## $ purpose : chr [1:40000] "debt_consolidation" "debt_consolidation" "home_improvement" "car" ...
## $ dti : num [1:40000] 22.05 15.67 17.34 3.58 12.25 ...
## $ open_acc : num [1:40000] 15 9 8 11 8 10 8 5 8 14 ...
## $ revol_bal : num [1:40000] 10211 10068 3755 1459 29656 ...
## $ revol_util : num [1:40000] 31.5 32.7 25.5 22.8 70.6 34.3 36.1 66.1 24.6 75.6 ...
## $ total_acc : num [1:40000] 17 21 19 34 21 17 9 12 30 21 ...
## $ total_rec_int : num [1:40000] 1574 2364 943 436 2778 ...
## $ application_type : chr [1:40000] "Individual" "Individual" "Individual" "Joint App" ...
## $ tot_cur_bal : num [1:40000] 17440 35104 19995 8713 29656 ...
## $ total_rev_hi_lim : num [1:40000] 32400 30800 14700 6400 42000 28200 21400 6200 17500 30300 ...
## $ Status : num [1:40000] 0 0 0 0 0 0 1 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. loan_amnt = col_double(),
## .. int_rate = col_double(),
## .. grade = col_character(),
## .. home_ownership = col_character(),
## .. annual_inc = col_double(),
## .. verification_status = col_character(),
## .. purpose = col_character(),
## .. dti = col_double(),
## .. open_acc = col_double(),
## .. revol_bal = col_double(),
## .. revol_util = col_double(),
## .. total_acc = col_double(),
## .. total_rec_int = col_double(),
## .. application_type = col_character(),
## .. tot_cur_bal = col_double(),
## .. total_rev_hi_lim = col_double(),
## .. Status = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Checking for NAs
any(is.na(loan_sample))
## [1] FALSE
The result of “FALSE” from the check indicates that there are no missing values (NAs) in the “loan_sample” dataset. So the dataset is provided with full values means that we can work with this dataset without cleaning NA’s.
# First we have to check whether all variables have been imported in the correct format.
first_overview <- overview(loan_sample)
plot(first_overview)
It is clearly visible that the columns with the data type character
still have to be converted to factors. In addition, the “Status” column
is currently still a numeric data type. This must also be converted to
factor, as it is a binary variable. To do further analysis this needs to
be done to get further data with the same structure to use for our model
and analysis.
loan_sample <- loan_sample %>%
mutate_if(is.character, as.factor) %>% # Convert all character columns to factors
mutate(Status = as.factor(Status)) # Convert 'Status' column to factor
# Plot to see, if the datatypes are correct
second_overview <- overview(loan_sample)
plot(second_overview)
That looks right, now we can count the number of variables. As we can
see, there are no anomalies in this bar chart accordingly, we should be
able to proceed with these converted data.
# Count numeric variables in 'loan_sample'
numeric_vars_count <- sum(sapply(loan_sample, is.numeric))
# Count categorical (factor) variables in 'loan_sample'
categorical_vars_count <- sum(sapply(loan_sample, is.factor))
# Print counts of numeric and categorical variables
cat("There are", numeric_vars_count, "numeric variables and", categorical_vars_count, "categorical variables in the dataset.")
## There are 11 numeric variables and 6 categorical variables in the dataset.
categorical_levels <- sapply(loan_sample, function(x) if(is.factor(x)) length(unique(x)) else NA)
# Find the name of the categorical variable with the most levels
cat_var_most_levels <- names(which.max(categorical_levels))
# Find the number of levels for this variable
levels_count <- max(categorical_levels, na.rm = TRUE)
# Print the variable name and the number of levels
cat("The categorical variable with the most levels in the dataset is:", cat_var_most_levels, "with", levels_count, "levels.")
## The categorical variable with the most levels in the dataset is: purpose with 13 levels.
# Categorial Label with the most Levels
summary(loan_sample)
## loan_amnt int_rate grade home_ownership annual_inc
## Min. : 1000 Min. : 5.31 A: 7329 MORTGAGE:17701 Min. : 5000
## 1st Qu.: 7000 1st Qu.: 9.44 B:13166 OWN : 3981 1st Qu.: 42000
## Median :10075 Median :12.29 C:11842 RENT :18318 Median : 57000
## Mean :11687 Mean :12.62 D: 7663 Mean : 63400
## 3rd Qu.:15000 3rd Qu.:15.05 3rd Qu.: 77000
## Max. :40000 Max. :27.49 Max. :400000
##
## verification_status purpose dti
## Not Verified :14373 debt_consolidation:23342 Min. : 0.00
## Source Verified:16116 credit_card : 9362 1st Qu.:12.13
## Verified : 9511 other : 2337 Median :17.60
## home_improvement : 2099 Mean :18.22
## major_purchase : 794 3rd Qu.:23.86
## medical : 444 Max. :60.14
## (Other) : 1622
## open_acc revol_bal revol_util total_acc
## Min. : 1.00 Min. : 0 Min. : 0.00 Min. : 3.00
## 1st Qu.: 8.00 1st Qu.: 5615 1st Qu.: 34.80 1st Qu.:15.00
## Median :10.00 Median : 9818 Median : 52.40 Median :20.00
## Mean :10.31 Mean :11995 Mean : 52.17 Mean :21.27
## 3rd Qu.:13.00 3rd Qu.:15832 3rd Qu.: 70.00 3rd Qu.:27.00
## Max. :23.00 Max. :78762 Max. :121.40 Max. :57.00
##
## total_rec_int application_type tot_cur_bal total_rev_hi_lim
## Min. : 0.0 Individual:39450 Min. : 0 Min. : 300
## 1st Qu.: 673.2 Joint App : 550 1st Qu.: 25136 1st Qu.: 13000
## Median :1342.5 Median : 53722 Median : 20900
## Mean :1818.5 Mean : 99100 Mean : 24194
## 3rd Qu.:2432.9 3rd Qu.:157834 3rd Qu.: 32200
## Max. :8834.9 Max. :472573 Max. :100000
##
## Status
## 0:34810
## 1: 5190
##
##
##
##
##
Our dataset contains information from 40,000 loans, giving us a detailed look into how people handle their finances. On average, folks are requesting loans around $11,687, but it’s interesting to note that the range is pretty wide, going from as little as $1,000 to as much as $40,000. This tells us that people have different needs when it comes to borrowing money.
Now, when it comes to interest rates, the average is about 12.62%, but it’s quite a spread. Some lucky ones are scoring rates as low as 5.31%, while others are dealing with rates as high as 27.49%. It’s a reminder that what you end up with can vary quite a bit.
On the income side of things, the average yearly income for loan applicants is $63,400. But, as expected, there’s a range here too. Some folks are pulling in more, and some are pulling in less. This is just a snapshot of the diversity in people’s financial situations.
We also took a look at the debt-to-income ratio (DTI), which gives us an idea of how much of a person’s income is tied up in paying off debt. On average, it’s sitting at 18.22%. It’s a useful measure for understanding how comfortably people can manage their debt.
Switching gears to credit lines, on average, people have around 10 open credit lines. But, if we look back over time, some have had as many as 21. This gives us insight into how people handle credit – how many cards or loans they’ve had in the past.
The average amount owed on these credit lines is $11,995, and people tend to use about half of their available credit. This shows that folks are using their credit but not maxing it out completely.
When it comes to interest payments, the average is around $1,818. That’s extra money on top of the initial loan amount. Looking at the big picture, the total average outstanding debt across all loans is $99,100. This tells us that, collectively, there’s a significant amount of money being borrowed.
Finally, the average credit limit on people’s credit lines is $24,194. This is the maximum amount the bank allows them to borrow. It’s a cap on their potential debt.
So, diving into these numbers not only gives us a sense of individual financial situations but also paints a broader picture of how people navigate their finances through loans.
# Plot the distribution of the target variable
ggplot(loan_sample, aes(x = Status, fill = Status)) +
geom_bar() +
scale_fill_brewer(palette = "Set1") +
labs(title = 'Distribution of target variable (Status)', x = 'Status', y = 'Count') +
theme_minimal()
The ‘Status’ variable in our dataset serves as a crucial indicator of
credit approval outcomes. It distinctly classifies these outcomes into
two levels: ‘0’ represents a ‘No,’ indicating that the credit has not
been approved, while ‘1’ signifies a ‘Yes,’ confirming the approval of
the credit.
Upon closer inspection, our dataset reveals a notable imbalance in the distribution of these outcomes. Specifically, there is a significantly higher number of instances where credits were not approved (Status ‘0’) compared to those that were approved (Status ‘1’). This imbalance is a critical aspect to consider, especially in the context of machine learning model development.
In machine learning, an unbalanced dataset can introduce bias into the predictive model. In our case, the imbalance may lead the model to overemphasize the majority class (non-approved credits), potentially resulting in a classifier that tends to predict a ‘No’ more frequently.
To mitigate this issue, it becomes essential to address the class imbalance before training our predictive model. Balancing the dataset involves techniques such as oversampling the minority class (approved credits), undersampling the majority class (non-approved credits), or using more advanced methods like synthetic data generation. By doing so, we aim to create a more equitable representation of both outcomes in our training data.
# Histograms
loan_sample %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~variable, scales = 'free_x') +
theme_minimal()
The histogram we created for the variable open_acc has gaps. We should examine this more closely.
# Create boxplot of the variable total_acc
ggplot(loan_sample, aes(x = open_acc)) +
geom_histogram(bins = 22) + # Adjust the number of bins as necessary
labs(x = "Open Accounts", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Since the variable ranges from 1 to 22, using 30 bins leads to empty
spaces. Adjusting the bins to match the actual value range has provided
a clearer, gap-free histogram.
# Boxplots
loan_sample %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = 'free') +
theme_minimal()
Elaborate your view on how to proceed in dealing with the outliers and – if necessary – take appropriate action.
# Reshape the data to long format and scale the numeric values
loan_sample_long_scaled <- loan_sample %>%
select(where(is.numeric)) %>%
mutate(across(everything(), scale)) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
# Create a boxplot with the scaled values
ggplot(loan_sample_long_scaled, aes(x = variable, y = value)) +
geom_boxplot() +
labs(x = "", y = "Scaled Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
The box plot effectively captures the distribution of scaled numeric variables, and the presence of outliers is evident as individual points outside the box plot. Notably, several variables exhibit outliers, indicating the potential presence of extreme values in our dataset.
To gain a deeper understanding, it is essential to contextualize these outliers. Are they genuine extreme values reflective of the population, or do they stem from errors in data collection or measurement discrepancies? This distinction is crucial in determining the appropriate course of action.
Understanding the impact of outliers on our analysis is paramount. Outliers can significantly influence statistical measures, potentially skewing measures of central tendency and affecting the overall spread of the data. For instance, they may impact the accuracy of mean values and standard deviations.
A closer look at the box plot reveals specific variables that seem to have more pronounced outliers. Exploring these outliers in the context of the variables they pertain to could reveal valuable insights and guide subsequent steps in the analysis.
Various statistical techniques, such as calculating z-scores or leveraging the interquartile range (IQR), can systematically identify outliers. Implementing these techniques allows for a more rigorous and objective assessment of outlier presence.
When deciding how to address outliers, removal or correction is a common approach. However, alternative strategies exist, including transforming the data, winsorizing (capping extreme values), or stratifying analyses based on the presence of outliers. Each strategy comes with its considerations and potential implications for the overall analysis.
In summary, recognizing and addressing outliers is a nuanced process that requires careful consideration of their nature, impact, and appropriate handling strategies to ensure the integrity of our analysis.
# Diagnosing outliers and selecting variables with more than 1% outliers
variables_with_high_outlier_ratio <- loan_sample %>%
diagnose_outlier() %>%
filter(outliers_ratio > 1) %>%
pull(variables)
# Plotting the outliers for these variables
loan_sample %>%
plot_outlier(one_of(variables_with_high_outlier_ratio))
As we worked on refining our model, we initially thought removing more
outliers would make it better, so we decided to keep 95% of our data and
discard the top 5% as outliers. Surprisingly, this move didn’t improve
the model; in fact, it made it less precise.
Realizing that less might be more in this case, we changed our approach. Instead of removing 5%, we opted for a more cautious strategy, cutting out only 1% of outliers. This way, we’re holding onto 99% of our data, hoping to strike a balance between cleaning up noisy data and keeping enough information for the model to learn effectively.
This back-and-forth shows that finding the right balance in handling outliers is a bit of a puzzle, and we’re trying to crack it to make our model as accurate and reliable as possible.
# Function for adjusting the values to the quantile limits
cap_values_at_quantiles <- function(x) {
quantiles <- quantile(x, c(0.01, 0.99), na.rm = TRUE)
x <- ifelse(x < quantiles[1], quantiles[1], x)
x <- ifelse(x > quantiles[2], quantiles[2], x)
x
}
loan_sample_truncated <- loan_sample %>%
mutate(across(where(is.numeric), cap_values_at_quantiles))
loan_sample_balanced <- ovun.sample(Status ~ ., data=loan_sample_truncated, method = "under")
loan_sample_under <- data.frame(loan_sample_balanced[["data"]])
# Plot the balance
ggplot(loan_sample_under, aes(x = Status, fill = Status, group = Status)) +
geom_bar() +
ylab("Count") +
xlab("Status of the loan") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
ggtitle("New balanced dataset")
Balancing the dataset is key for training models, especially when
dealing with imbalanced classes like predicting loan default. This
ensures a fair representation of both outcomes, preventing the model
from favoring the more common class. By striking this balance, we aim to
improve the model’s accuracy and fairness in predicting both common and
less frequent events, contributing to a more reliable and equitable
performance.
(i.e. default vs non-default). Discuss the visualizations. Which variables seem to be relevant in predicting the target feature?
# Define columns to exclude (these are column with characters)
exclude_columns <- c(3, 4, 6, 7, 14)
# Loop through columns, excluding the specified ones
for (i in setdiff(1:ncol(loan_sample_under), exclude_columns)) {
# Only plot numeric columns
if (is.numeric(loan_sample_under[[i]])) {
print(ggplot(loan_sample_under, aes_string(y = names(loan_sample_under)[i], color = "Status")) +
geom_boxplot() +
ylab(names(loan_sample_under[i])) +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()))
}
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
While looking at the boxplots, it’s clear that some variables don’t show much difference between the ‘Status’ ‘0’ and ‘1’ categories. Notably, interest rate, annual income, total revenue, and DTI stand out with significant differences between the two statuses. However, manually creating each boxplot can be time-consuming, and interpreting them might lack a systematic approach.
To tackle this, we decided on a more comprehensive method by combining manual boxplot analysis with insights from the Boruta algorithm. This hybrid approach blends both manual examination and algorithmic assistance, aiming for a nuanced understanding of the data and highlighting potential correlations. By doing this, we aim for a well-rounded outcome, drawing on the strengths of both manual and automated analyses to gain a more robust understanding of the dataset’s relationships.
This hybrid model allows for a more informed interpretation, ensuring that the observed deviations in specific variables are thoroughly considered within the broader dataset context. It represents a holistic and balanced approach, bringing together human expertise and algorithmic rigor to deepen our understanding of the correlation patterns in our data.
In the next step, we run the Boruta algorithm.The Boruta algorithm provides a more objective method for determining the importance of features as it is based on repeated, random comparisons rather than human judgment.
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## After 11 iterations, +23 secs:
## confirmed 12 attributes: annual_inc, dti, grade, home_ownership, int_rate and 7 more;
## rejected 1 attribute: application_type;
## still have 3 attributes left.
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## After 15 iterations, +31 secs:
## confirmed 2 attributes: open_acc, verification_status;
## still have 1 attribute left.
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
## [1] "loan_amnt" "int_rate" "grade"
## [4] "home_ownership" "annual_inc" "verification_status"
## [7] "purpose" "dti" "open_acc"
## [10] "revol_bal" "revol_util" "total_acc"
## [13] "total_rec_int" "tot_cur_bal" "total_rev_hi_lim"
With the Boruta method providing valuable insights, the next step in our
analysis involves a closer examination of multicollinearity.
Multicollinearity arises when one predictor variable in a multiple
regression model can be accurately predicted from the others. This
phenomenon can lead to skewed or misleading coefficient estimates and
draw potentially erroneous conclusions.
In essence, the presence of multicollinearity introduces a level of redundancy among predictor variables, making it challenging to isolate their individual effects on the response variable. When high correlations exist between predictors, it becomes difficult for the regression model to distinguish the unique contributions of each variable, potentially compromising the accuracy and reliability of the model.
Identifying and addressing multicollinearity is critical for refining the integrity of our regression analysis. It allows us to disentangle the interdependencies between predictor variables and ensures that each variable’s impact on the response variable is accurately reflected. By navigating the complexities of multicollinearity, we aim to enhance the robustness of our regression model, providing more accurate and trustworthy insights into the relationships between predictors and the target variable.
# Generate a bar plot for each categorical variable
categorical_vars <- c("home_ownership", "verification_status", "purpose", "addr_state", "application_type")
# Loop over categorical variables and plot using aes() and without aes_string()
for (cat_var in categorical_vars) {
# Check if the column exists to avoid errors
if (!cat_var %in% names(loan_sample_under)) {
message(paste("Skipping", cat_var, "as it is not found in the dataset."))
next
}
# Create the plot
plot <- loan_sample_under %>%
group_by(.data[[cat_var]], Status) %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x = .data[[cat_var]], y = Count, fill = as.factor(Status))) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = paste("Bar plot of", cat_var, "by Status"), x = cat_var, y = "Count") +
scale_fill_discrete(name = "Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels for readability
# Print the plot
print(plot)
}
## Skipping addr_state as it is not found in the dataset.
Hier Text schreiben!
library(plotly)
# Assuming your data is in a dataframe called loan_sample and
# the relevant columns are named loan_amnt and annual_inc
plot_ly(data = loan_sample_under, x = ~loan_amnt, y = ~annual_inc,
type = 'scatter', mode = 'markers',
marker = list(size = 10, opacity = 0.5)) %>%
layout(title = 'Association between Loan Amount and Annual Income',
xaxis = list(title = 'Loan Amount Requested'),
yaxis = list(title = 'Annual Income of Borrower'))
As we examine the density plot, the regions with high density unveil common combinations of income and loan amounts, offering insights into potential standard loan products or characteristic borrower profiles. The clustering of data points in these areas indicates prevalent patterns that could be indicative of well-established loan products catering to specific income brackets or borrower demographics.
Notably, the observation of a wide range of incomes accompanied by a relatively narrow range of loan amounts prompts considerations about the sensitivity of loan amounts to income variations. This pattern may imply that, beyond a certain income threshold, the loan amount becomes less sensitive to further income increases. In other words, there seems to be a saturation point where increases in income may not significantly impact the loan amount, suggesting a potential cap or standardization in the lending process.
Understanding these density patterns not only sheds light on the typical borrower scenarios but also hints at the underlying dynamics between income and loan amounts. It invites further exploration into the factors influencing loan decisions and provides valuable insights for refining lending strategies or tailoring financial products to meet the diverse needs of different income segments.
Hier Text schreiben !!! ### 2.1: Dividing the sample into training and testing set
# Set seed for reproducibility
set.seed(7)
# Split the data into training (70%) and testing (30%) sets
splitIndex <- createDataPartition(loan_sample_under$Status, p = 0.7, list = FALSE)
training_set <- loan_sample_under[splitIndex,]
testing_set <- loan_sample_under[-splitIndex,]
Hier Text schreiben!
PercTable(loan_sample_under$Status)
##
## freq perc
##
## 0 5'191 50.0%
## 1 5'190 50.0%
PercTable(training_set$Status)
##
## freq perc
##
## 0 3'634 50.0%
## 1 3'633 50.0%
PercTable(testing_set$Status)
##
## freq perc
##
## 0 1'557 50.0%
## 1 1'557 50.0%
In the next step, we train the logit model. In terms of our inputs i.e. our Xs, we use all variables included in the data_new_under apart from the status, which is our Y. How would you interpret the results printed from the summary fit1?
##
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.502e+00 3.188e-01 -7.846 4.29e-15 ***
## loan_amnt 6.021e-05 6.042e-06 9.965 < 2e-16 ***
## int_rate 1.124e-01 1.531e-02 7.341 2.12e-13 ***
## gradeB 3.985e-01 1.049e-01 3.798 0.000146 ***
## gradeC 5.648e-01 1.367e-01 4.133 3.58e-05 ***
## gradeD 6.147e-01 2.014e-01 3.053 0.002268 **
## home_ownershipOWN 6.345e-03 9.334e-02 0.068 0.945806
## home_ownershipRENT 2.215e-01 6.796e-02 3.259 0.001119 **
## annual_inc -5.566e-06 1.183e-06 -4.706 2.52e-06 ***
## verification_statusSource Verified 1.267e-01 5.950e-02 2.130 0.033164 *
## verification_statusVerified 7.239e-02 6.820e-02 1.061 0.288522
## purposecredit_card -5.311e-02 2.425e-01 -0.219 0.826621
## purposedebt_consolidation -3.014e-02 2.385e-01 -0.126 0.899433
## purposehome_improvement 1.459e-01 2.608e-01 0.560 0.575775
## purposehouse -7.137e-01 4.187e-01 -1.705 0.088253 .
## purposemajor_purchase -1.038e-02 2.969e-01 -0.035 0.972113
## purposemedical 9.694e-02 3.387e-01 0.286 0.774748
## purposemoving -6.495e-01 3.769e-01 -1.723 0.084845 .
## purposeother -1.591e-01 2.552e-01 -0.623 0.533091
## purposerenewable_energy 9.113e-01 9.181e-01 0.993 0.320917
## purposesmall_business 3.667e-01 3.500e-01 1.048 0.294720
## purposevacation 8.011e-02 3.429e-01 0.234 0.815283
## purposewedding -1.836e+00 8.557e-01 -2.145 0.031959 *
## dti 1.412e-02 3.644e-03 3.875 0.000107 ***
## open_acc 4.295e-02 9.351e-03 4.593 4.36e-06 ***
## revol_bal -1.064e-05 7.926e-06 -1.342 0.179469
## revol_util 1.363e-03 1.988e-03 0.686 0.492961
## total_acc -8.441e-03 3.917e-03 -2.155 0.031175 *
## total_rec_int -2.410e-04 2.285e-05 -10.545 < 2e-16 ***
## application_typeJoint App 1.403e-01 2.058e-01 0.682 0.495248
## tot_cur_bal -7.800e-07 3.778e-07 -2.065 0.038950 *
## total_rev_hi_lim 1.483e-06 4.517e-06 0.328 0.742687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10074.2 on 7266 degrees of freedom
## Residual deviance: 9186.6 on 7235 degrees of freedom
## AIC: 9250.6
##
## Number of Fisher Scoring iterations: 4
We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.
## [1] "loan_amnt" "int_rate"
## [3] "gradeB" "gradeC"
## [5] "gradeD" "home_ownershipRENT"
## [7] "annual_inc" "verification_statusSource Verified"
## [9] "purposewedding" "dti"
## [11] "open_acc" "total_acc"
## [13] "total_rec_int" "tot_cur_bal"
Next, we aim to evaluate the predictive performance of our model. To do so, we will plot the ROC curve.
testing_set$fit1_score <- predict(fit1,type='response',testing_set)
fit1_pred <- prediction(testing_set$fit1_score, testing_set$Status)
fit1_roc <- performance(fit1_pred, "tpr", "fpr")
plot(fit1_roc, lwd=1, colorize = TRUE, main = "Fit1: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)
In the subsequent step of our analysis, we delve into the intricacies of
our model by visualizing the Precision/Recall Curve. This curve serves
as a comprehensive illustration of the delicate balance between the true
positive rate and the positive predictive value across various
probability thresholds.
As we navigate through different probability thresholds, the Precision/Recall Curve provides a nuanced perspective on the model’s performance. It vividly captures the trade-offs inherent in our predictive model, offering valuable insights into how adjusting the probability threshold impacts both precision and recall.
Precision, representing the accuracy of positive predictions, and recall, measuring the model’s ability to capture all actual positive instances, are pivotal metrics in evaluating classifier performance. By exploring this curve, we gain a deeper understanding of the model’s behavior across a spectrum of probability thresholds, enabling us to make informed decisions about the optimal threshold for our specific use case.
This visual representation not only aids in model evaluation but also guides us in fine-tuning our predictive approach for achieving the desired balance between precision and recall, aligning with the specific objectives and requirements of our analysis.
fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")
Hier noch genau erklären und herausfinden warum es ein Sprung macht.
Erklärung von Chat GPT: Die Kurve kann Unregelmäßigkeiten in der
Verteilung der Klassifizierungswahrscheinlichkeiten widerspiegeln. Ein
kleiner Satz von Beispielen, die sehr sicher als positiv klassifiziert
werden, gefolgt von einer größeren Gruppe von Beispielen mit weniger
Sicherheit, könnte zu einem solchen Sprung führen.
# Extract the confusion matrix
cm <- confusionMatrix(as.factor(round(testing_set$fit1_score)), testing_set$Status)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 997 561
## 1 560 996
##
## Accuracy : 0.64
## 95% CI : (0.6229, 0.6569)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.28
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6403
## Specificity : 0.6397
## Pos Pred Value : 0.6399
## Neg Pred Value : 0.6401
## Prevalence : 0.5000
## Detection Rate : 0.3202
## Detection Prevalence : 0.5003
## Balanced Accuracy : 0.6400
##
## 'Positive' Class : 0
##
# AUC
fit1_auc <- performance(fit1_pred, measure = "auc")
# Overall accuracy
accuracy <- sum(diag(cm$table)) / sum(cm$table)
# Print of the values
cat("AUC: ", fit1_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy)
## AUC: 69.61645
## Overall Accuracy: 0.6400128
Thinking about the pre-processing steps that you carried out before training the logistic classifier: Can you think of a way to improve the predictive performance of your data?
# Set seed for reproducibility
set.seed(7)
# Splitting the data into testing and training data
splitIndex2 <- createDataPartition(loan_sample$Status, p = 0.7, list = FALSE)
training_set2 <- loan_sample[splitIndex2,]
testing_set2 <- loan_sample[-splitIndex2,]
# Percentage table of Status with out preprocessing
PercTable(loan_sample$Status)
##
## freq perc
##
## 0 34'810 87.0%
## 1 5'190 13.0%
PercTable(training_set2$Status)
##
## freq perc
##
## 0 24'367 87.0%
## 1 3'633 13.0%
PercTable(testing_set2$Status)
##
## freq perc
##
## 0 10'443 87.0%
## 1 1'557 13.0%
# Fitting the model
fit2 <- glm(Status ~ ., data=training_set2,family=binomial())
summary(fit2)
##
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.312e+00 2.390e-01 -18.041 < 2e-16 ***
## loan_amnt 6.705e-05 4.327e-06 15.496 < 2e-16 ***
## int_rate 1.032e-01 9.869e-03 10.452 < 2e-16 ***
## gradeB 3.546e-01 8.463e-02 4.190 2.79e-05 ***
## gradeC 6.056e-01 1.012e-01 5.984 2.18e-09 ***
## gradeD 6.676e-01 1.414e-01 4.721 2.35e-06 ***
## home_ownershipOWN 6.147e-02 6.938e-02 0.886 0.375618
## home_ownershipRENT 2.356e-01 5.079e-02 4.638 3.51e-06 ***
## annual_inc -5.235e-06 8.955e-07 -5.846 5.04e-09 ***
## verification_statusSource Verified 7.956e-02 4.464e-02 1.782 0.074712 .
## verification_statusVerified 7.838e-02 5.021e-02 1.561 0.118519
## purposecredit_card -6.694e-02 1.891e-01 -0.354 0.723338
## purposedebt_consolidation -6.613e-02 1.859e-01 -0.356 0.722028
## purposehome_improvement 1.180e-01 2.022e-01 0.584 0.559546
## purposehouse -2.357e-01 3.278e-01 -0.719 0.472001
## purposemajor_purchase 5.665e-03 2.253e-01 0.025 0.979936
## purposemedical 2.148e-01 2.394e-01 0.897 0.369624
## purposemoving -7.206e-01 3.088e-01 -2.334 0.019604 *
## purposeother 1.113e-02 1.971e-01 0.056 0.954971
## purposerenewable_energy 5.033e-01 5.596e-01 0.899 0.368417
## purposesmall_business 2.876e-01 2.463e-01 1.168 0.242874
## purposevacation 3.334e-01 2.538e-01 1.313 0.189048
## purposewedding -8.367e-02 5.370e-01 -0.156 0.876169
## dti 1.374e-02 2.605e-03 5.273 1.34e-07 ***
## open_acc 3.992e-02 6.667e-03 5.989 2.11e-09 ***
## revol_bal -6.500e-08 5.694e-06 -0.011 0.990891
## revol_util 1.746e-03 1.437e-03 1.215 0.224529
## total_acc -1.007e-02 2.862e-03 -3.519 0.000433 ***
## total_rec_int -2.345e-04 1.591e-05 -14.742 < 2e-16 ***
## application_typeJoint App 8.834e-02 1.401e-01 0.631 0.528206
## tot_cur_bal -5.014e-07 2.871e-07 -1.747 0.080667 .
## total_rev_hi_lim -6.259e-06 3.384e-06 -1.850 0.064351 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21611 on 27999 degrees of freedom
## Residual deviance: 19941 on 27968 degrees of freedom
## AIC: 20005
##
## Number of Fisher Scoring iterations: 5
# We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.
significant.variables2 <- summary(fit2)$coeff[-1,4] < 0.05
names(significant.variables2)[significant.variables2 == TRUE]
## [1] "loan_amnt" "int_rate" "gradeB"
## [4] "gradeC" "gradeD" "home_ownershipRENT"
## [7] "annual_inc" "purposemoving" "dti"
## [10] "open_acc" "total_acc" "total_rec_int"
testing_set2$fit2_score <- predict(fit2,type='response',testing_set2)
fit2_pred <- prediction(testing_set2$fit2_score, testing_set2$Status)
fit2_roc <- performance(fit2_pred, "tpr", "fpr")
plot(fit2_roc, lwd=1, colorize = TRUE, main = "Fit2: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)
fit2_precision <- performance(fit2_pred, measure = "prec", x.measure = "rec")
plot(fit2_precision, main="Fit2: Logit - Precision vs Recall")
# Extract the confusion matrix
cm2 <- confusionMatrix(as.factor(round(testing_set2$fit2_score)), testing_set2$Status)
print(cm2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10414 1534
## 1 29 23
##
## Accuracy : 0.8698
## 95% CI : (0.8636, 0.8757)
## No Information Rate : 0.8702
## P-Value [Acc > NIR] : 0.5714
##
## Kappa : 0.0204
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99722
## Specificity : 0.01477
## Pos Pred Value : 0.87161
## Neg Pred Value : 0.44231
## Prevalence : 0.87025
## Detection Rate : 0.86783
## Detection Prevalence : 0.99567
## Balanced Accuracy : 0.50600
##
## 'Positive' Class : 0
##
# AUC
fit2_auc <- performance(fit2_pred, measure = "auc")
# Overall accuracy
accuracy2 <- sum(diag(cm2$table)) / sum(cm2$table)
# Print of the values
cat("AUC: ", fit2_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy2)
## AUC: 69.35202
## Overall Accuracy: 0.86975
# Set seed for reproducibility
set.seed(7)
# Deleting column application_type
loan_sample_under_new <- loan_sample_under[-14]
# Splitting the data into testing and training data
splitIndex3 <- createDataPartition(loan_sample_under_new$Status, p = 0.7, list = FALSE)
training_set3 <- loan_sample_under_new[splitIndex3,]
testing_set3 <- loan_sample_under_new[-splitIndex3,]
# Percentage table of Status with out preprocessing
PercTable(loan_sample_under_new$Status)
##
## freq perc
##
## 0 5'191 50.0%
## 1 5'190 50.0%
PercTable(training_set3$Status)
##
## freq perc
##
## 0 3'634 50.0%
## 1 3'633 50.0%
PercTable(testing_set3$Status)
##
## freq perc
##
## 0 1'557 50.0%
## 1 1'557 50.0%
# Fitting the model
fit3 <- glm(Status ~ ., data=training_set3,family=binomial())
summary(fit3)
##
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.503e+00 3.188e-01 -7.852 4.09e-15 ***
## loan_amnt 6.064e-05 6.010e-06 10.090 < 2e-16 ***
## int_rate 1.127e-01 1.530e-02 7.364 1.78e-13 ***
## gradeB 3.981e-01 1.049e-01 3.794 0.000148 ***
## gradeC 5.639e-01 1.367e-01 4.127 3.68e-05 ***
## gradeD 6.123e-01 2.014e-01 3.041 0.002358 **
## home_ownershipOWN 5.751e-03 9.334e-02 0.062 0.950871
## home_ownershipRENT 2.216e-01 6.796e-02 3.260 0.001114 **
## annual_inc -5.612e-06 1.181e-06 -4.752 2.01e-06 ***
## verification_statusSource Verified 1.261e-01 5.949e-02 2.120 0.034004 *
## verification_statusVerified 7.273e-02 6.820e-02 1.067 0.286189
## purposecredit_card -5.463e-02 2.424e-01 -0.225 0.821709
## purposedebt_consolidation -3.023e-02 2.385e-01 -0.127 0.899133
## purposehome_improvement 1.451e-01 2.608e-01 0.557 0.577858
## purposehouse -7.130e-01 4.189e-01 -1.702 0.088708 .
## purposemajor_purchase -1.169e-02 2.968e-01 -0.039 0.968574
## purposemedical 1.040e-01 3.384e-01 0.307 0.758693
## purposemoving -6.480e-01 3.768e-01 -1.720 0.085432 .
## purposeother -1.589e-01 2.552e-01 -0.623 0.533444
## purposerenewable_energy 9.099e-01 9.179e-01 0.991 0.321555
## purposesmall_business 3.653e-01 3.499e-01 1.044 0.296470
## purposevacation 7.889e-02 3.429e-01 0.230 0.818059
## purposewedding -1.838e+00 8.560e-01 -2.147 0.031786 *
## dti 1.429e-02 3.635e-03 3.931 8.45e-05 ***
## open_acc 4.268e-02 9.342e-03 4.568 4.91e-06 ***
## revol_bal -1.072e-05 7.925e-06 -1.353 0.176182
## revol_util 1.342e-03 1.988e-03 0.675 0.499482
## total_acc -8.411e-03 3.917e-03 -2.147 0.031764 *
## total_rec_int -2.420e-04 2.280e-05 -10.615 < 2e-16 ***
## tot_cur_bal -7.642e-07 3.770e-07 -2.027 0.042670 *
## total_rev_hi_lim 1.499e-06 4.516e-06 0.332 0.739938
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10074.2 on 7266 degrees of freedom
## Residual deviance: 9187.1 on 7236 degrees of freedom
## AIC: 9249.1
##
## Number of Fisher Scoring iterations: 4
# We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.
significant.variables3 <- summary(fit3)$coeff[-1,4] < 0.05
names(significant.variables3)[significant.variables3 == TRUE]
## [1] "loan_amnt" "int_rate"
## [3] "gradeB" "gradeC"
## [5] "gradeD" "home_ownershipRENT"
## [7] "annual_inc" "verification_statusSource Verified"
## [9] "purposewedding" "dti"
## [11] "open_acc" "total_acc"
## [13] "total_rec_int" "tot_cur_bal"
testing_set3$fit3_score <- predict(fit3,type='response',testing_set3)
fit3_pred <- prediction(testing_set3$fit3_score, testing_set3$Status)
fit3_roc <- performance(fit3_pred, "tpr", "fpr")
plot(fit3_roc, lwd=1, colorize = TRUE, main = "Fit3: Logit - ROC Curve")
lines(x=c(0, 1), y=c(0, 1), col="black", lwd=1, lty=3)
fit3_precision <- performance(fit3_pred, measure = "prec", x.measure = "rec")
plot(fit3_precision, main="Fit3: Logit - Precision vs Recall")
# Extract the confusion matrix
cm3 <- confusionMatrix(as.factor(round(testing_set3$fit3_score)), testing_set3$Status)
print(cm3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1001 560
## 1 556 997
##
## Accuracy : 0.6416
## 95% CI : (0.6245, 0.6585)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.2832
##
## Mcnemar's Test P-Value : 0.9284
##
## Sensitivity : 0.6429
## Specificity : 0.6403
## Pos Pred Value : 0.6413
## Neg Pred Value : 0.6420
## Prevalence : 0.5000
## Detection Rate : 0.3215
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.6416
##
## 'Positive' Class : 0
##
# AUC
fit3_auc <- performance(fit3_pred, measure = "auc")
# Overall accuracy
accuracy3 <- sum(diag(cm3$table)) / sum(cm3$table)
# Print of the values
cat("AUC: ", fit3_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy3)
## AUC: 69.62428
## Overall Accuracy: 0.6416185
Finally, thinking about putting your model into action and basing credit decisions on the prediction that it generates: • What kind of challenges may a company face if it would use your model in their daily business, in particular in regard to ethical challenges and moral obligations companies have? Please refer to the „common ethical issues in the context the creation of value from data” (see slides week 11) in your answer. • Can you think of a way how companies can overcome or at least mitigate the issues that you described above?